Attention-based neural encoder-decoder frameworks have been widely adoptedfor image captioning. Most methods force visual attention to be active forevery generated word. However, the decoder likely requires little to no visualinformation from the image to predict non-visual words such as "the" and "of".Other words that may seem visual can often be predicted reliably just from thelanguage model e.g., "sign" after "behind a red stop" or "phone" following"talking on a cell". In this paper, we propose a novel adaptive attention modelwith a visual sentinel. At each time step, our model decides whether to attendto the image (and if so, to which regions) or to the visual sentinel. The modeldecides whether to attend to the image and where, in order to extractmeaningful information for sequential word generation. We test our method onthe COCO image captioning 2015 challenge dataset and Flickr30K. Our approachsets the new state-of-the-art by a significant margin.
展开▼